(3) Logs Analysis using Mining.

Data set type: Industial-Anoki

💨🔥💨 Smoke Analysis

✅ python, ✅ Gitlab, ✅ Mongodb

Qs

Q1: What is the top of most common problems in pipelines?

Index

Nomenclature

- (STPS) Smoke test possible solution: It is the set of tentative errors that can be avoided by using smoke tests

Import python libraries

Load variables to perform the analysis

References and Pages

Collect data from MongoDB

Read data from Mongodb database

Analysis of data volumes.

Percentage of type jobs

Number of fails by stage number

Calculate the similarity between texts and apply filters

Filter logs data

Get fragment of text with error

Remove word if not exist

Test Filter for search the text error inside of the logs

Apply filter to all data

Remove stopwords

Exploratory analysis

In Python, one of the structures that most facilitates exploratory analysis is the Pandas DataFrame, which is the structure in which the information from the df is now stored. However, when tokenizing, there has been a major chandfdfdf Before dividing the text, the study elements were the df, and each one was in a row, thus fulfilling the condition of tidy data: an observation, adfrdfwdf When performing the tokenization, the element of study has become each token (word), thus violating the condition of tiddf ddftadf To get back to the ideal structure, each token list has to be expanded, doubling the value of the other columns as many times as ndfcesdfarydf This process is known as expansiondfor udfnestdf

Although it may seem an inefficient process (the number of rows increases a lot), this simple change facilitates activities of the type: grouping, counting, graphics dfdfdffdffdf

Total words used by each log event

Total words used by each project

Frequency of words

Create list of STPS (derivate 1)

Create list of STPS (derivate 2)